What is the "Bitter Lesson"?
The Bitter Lesson is a thesis introduced by Rich Sutton
Computer scientist who is considered one of the founders of the field reinforcement learning. He is known for popularizing the “scaling hypothesis” as well as the “bitter lesson”.
Historically, AI research has mostly designed systems to use a fixed amount of computing power, improving their performance by applying domain-specific human knowledge. In theory, this approach is compatible with also improving performance by scaling (increasing computing power), but in practice, the complications introduced by leveraging human ingenuity make it harder to also leverage computation. Available computing power keeps growing steadily in accordance with Moore’s law, and past trends suggest that leveraging this growth is what increases performance in the long run.
Some fields that Sutton cites as examples of the Bitter Lesson are:
-
Games: In chess, Deep Blue beat world champion Garry Kasparov by using enormous computing power (for the time) to search deeply through the tree of possible moves. Similarly, in Go, AlphaGo beat world champion Lee Sedol using deep learning plus Monte Carlo tree search to find its moves, instead of using human-engineered Go techniques. Less than a year later, AlphaZero beat AlphaGo using self-play, without using human-generated Go data at all. None of these advances relied on humans coding in a deeper strategic understanding
-
Vision: Early computer vision methods worked with human-engineered features
and convolution kernels to perform image recognition tasks, but it was later found that leveraging more computeFeatureView full definitionA feature of a region of input space that corresponds to a useful pattern. For example, in an image detector, a set of neurons that detects cars might be a feature.
and letting convolutional neural nets (CNNs) learn their own features yields much better performance.ComputeView full definitionShorthand for “computing power”. It may refer to, for instance, physical infrastructure such as CPUs or GPUs that perform processing, or the amount of processing power needed to train a model.
Modern AI has learned to favor general-purpose methods of search and learning which continue to scale with increasing compute. Over the last couple of generations of transformer models, simply scaling models has been so effective that it has led OpenAI to propose scaling laws The relationship between a model’s performance and the amount of compute used to train it.